home
***
CD-ROM
|
disk
|
FTP
|
other
***
search
/
Collection of Internet
/
Collection of Internet.iso
/
faq
/
comp
/
comp_spe
/
part3
< prev
Wrap
Internet Message Format
|
1994-04-16
|
44KB
Path: bloom-beacon.mit.edu!senator-bedfellow.mit.edu!faqserv
From: andrewh@speech.su.oz.au (Andrew Hunt)
Newsgroups: comp.speech,comp.answers,news.answers
Subject: comp.speech Frequently Asked Questions - part 3/3
Supersedes: <comp-speech-faq/part3_764040899@rtfm.mit.edu>
Followup-To: comp.speech
Date: 16 Apr 1994 13:08:05 GMT
Organization: Speech Technology Group, The University of Sydney
Lines: 996
Approved: news-answers-request@MIT.Edu
Expires: 28 May 1994 13:05:48 GMT
Message-ID: <comp-speech-faq/part3_766501548@rtfm.mit.edu>
References: <comp-speech-faq/part1_766501548@rtfm.mit.edu>
Reply-To: andrewh@speech.su.oz.au (Andrew Hunt)
NNTP-Posting-Host: bloom-picayune.mit.edu
Summary: Useful information about Speech Technology
X-Last-Updated: 1994/04/06
Originator: faqserv@bloom-picayune.MIT.EDU
Xref: bloom-beacon.mit.edu comp.speech:2285 comp.answers:4934 news.answers:18148
Archive-name: comp-speech-faq/part3
Last-modified: 1994/04/06
SECTION 5 - Speech Synthesis
Q5.1: What is speech synthesis?
Speech synthesis is the task of transforming written input to spoken output.
The input can either be provided in a graphemic/orthographic or a phonemic
script, depending on its source.
------------------------------------------------------------------------
Q5.2: How can speech synthesis be performed?
There are several algorithms. The choice depends on the task they're used
for. The easiest way is to just record the voice of a person speaking the
desired phrases. This is useful if only a restricted volume of phrases and
sentences is used, e.g. messages in a train station, or schedule information
via phone. The quality depends on the way recording is done.
More sophisticated but worse in quality are algorithms which split the
speech into smaller pieces. The smaller those units are, the less are they
in number, but the quality also decreases. An often used unit is the phoneme,
the smallest linguistic unit. Depending on the language used there are about
35-50 phonemes in western European languages, i.e. there are 35-50 single
recordings. The problem is combining them as fluent speech requires fluent
transitions between the elements. The intellegibility is therefore lower, but
the memory required is small.
A solution to this dilemma is using diphones. Instead of splitting at the
transitions, the cut is done at the center of the phonemes, leaving the
transitions themselves intact. This gives about 400 elements (20*20) and
the quality increases.
The longer the units become, the more elements are there, but the quality
increases along with the memory required. Other units which are widely used
are half-syllables, syllables, words, or combinations of them, e.g. word stems
and inflectional endings.
------------------------------------------------------------------------
Q5.3: What are some good references/books on synthesis?
The following are good introductory books/articles.
Douglas O'Shaughnessy -- Speech Communication: Human and Machine
Addison Wesley series in Electrical Engineering: Digital Signal Processing,
1987.
D. H. Klatt, "Review of Text-To-Speech Conversion for English", Jnl. of
the Acoustic Society of America (JASA), v82, Sept. 1987, pp 737-793.
I. H. Witten. Principles of Computer Speech.
(London: Academic Press, Inc., 1982).
John Allen, Sharon Hunnicut and Dennis H. Klatt, "From Text to Speech:
The MITalk System", Cambridge University Press, 1987.
------------------------------------------------------------------------
Q5.4: What software/hardware is available?
In the last year there has been a great increase in the release of speech
synthesis software - both commercial and public domain. The following is
a list of as many products/packages as I can find out about. Any help in
keeping this list up-to-date will be appreciated.
Package: ORATOR Text-to-Speech Synthesizer
Platform: SUN SPARC, Decstation 5000. Portable to other UNIX platforms.
Description: Sophisticated speech synthesis package. Has text preprocessing
(for abbreviations, numbers), acronym citation rules, and human-like
spelling routines. High accuracy for pronunciation of names of
people, places and businesses in America, text-to-speech translation
for common words; rules for stress and intonation marking, based on
natural-sounding demisyllable synthesis; various methods of user
control and customization at most stages of processing. Currently,
ORATOR is most appropriate for applications containing a large
component of names in the text, and requires some amount of user-
specified text-preprocessing to produce good quality speech for
general text.
Hardware: Standard audio output of SPARC, or Decstation audio hardware.
At least 16M of memory recommended.
Cost: Binary License: $5,000.
Source license for porting or commercial use: $30,000.
Availability: Contact Bellcore's Licensing Office (1-800-527-1080)
or email: jzilg@cc.bellcore.com (John Zilg)
Package: Text to phoneme program (1)
Platform: unknown
Description: Text to phoneme program. Based on Naval Research Lab's
set of text to phoneme rules.
Availability: By FTP from "shark.cse.fau.edu" (131.91.80.13) in the directory
/pub/src/phon.tar.Z
Package: Text to phoneme program (2)
Platform: unknown
Description: Text to phoneme program.
Availability: By FTP from "wuarchive.wustl.edu" in the file
/mirrors/unix-c/utils/phoneme.c
Package: Text to phoneme program (3)
Description: A public domain version of the same Naval Research Lab
text to phoneme rules.
Availability: By anonymous ftp from
svr-ftp.eng.cam.ac.uk:comp.speech/sources/english2phoneme.shar
Package: Text to speech program
Description: A implementation of the Klatt phoneme to waveform speech
synthesiser.
Availability: By anonymous ftp from
svr-ftp.eng.cam.ac.uk:comp.speech/sources/klatt-0.02.tar.Z
Package: "Speak" - a Text to Speech Program
Platform: Sun SPARC
Description: Text to speech program based on concatenation of pre-recorded
speech segments. A function library can be used to integrate
speech output into other code.
Hardware: SPARC audio I/O
Availability: by FTP from "wilma.cs.brown.edu" as /pub/speak.tar.Z
Package: TheBigMouth - a Text to Speech Program
Platform: NeXT
Description: Text to speech program based on concatenation of pre-recorded
speech segments. NeXT equivalent of "Speak" for Suns.
Availability: try NeXT archive sites such as sonata.cc.purdue.edu.
Package: TextToSpeech Kit
Platform: NeXT Computers
Description: The TextToSpeech Kit does unrestricted conversion of English
text to synthesized speech in real-time. The user has control over
speaking rate, median pitch, stereo balance, volume, and intonation
type. Text of any length can be spoken, and messages can be queued
up, from multiple applications if desired. Real-time controls such
as pause, continue, and erase are included. Pronunciations are
derived primarily by dictionary look-up. The Main Dictionary has
nearly 100,000 hand-edited pronunciations which can be supplemented
or overridden with the User and Application dictionaries. A number
parser handles numbers in any form. A letter-to-sound knowledge base
provides pronunciations for words not in the Main or customized
dictionaries. Dictionary search order is under user control.
Special modes of text input are available for spelling and emphasis
of words or phrases. The actual conversion of text to speech is done
by the TextToSpeech Server. The Server runs as an independent task
in the background, and can handle up to 50 client connections.
Misc: The TextToSpeech Kit comes in two packages: the Developer Kit and the
User Kit. The Developer Kit enables developers to build and test
applications which incorporate text-to-speech. It includes the
TextToSpeech Server, the TextToSpeech Object, the pronunciation
editor PrEditor, several example applications, phonetic fonts,
example source code, and developer documentation. The User Kit
provides support for applications which incorporate text-to-speech.
It is a subset of the Developer Kit.
Hardware: Uses standard NeXT Computer hardware.
Cost: TextToSpeech User Kit: $175 CDN ($145 US)
TextToSpeech Developer Kit: $350 CDN ($290 US)
Upgrade from User to Developer Kit: $175 CDN ($145 US)
Availability: Trillium Sound Research
1500, 112 - 4th Ave. S.W., Calgary, Alberta, Canada, T2P 0H3
Tel: (403) 284-9278 Fax: (403) 282-6778
Order Desk: 1-800-L-ORATOR (US and Canada only)
Email: TTSInfo@trillium.ab.ca
Package: SGI Developers Toolbox Synthesiser
Platform: SGI
Description: The SGI Developer Toolbox 4.0 CDROM contains a basic
public domain text-to-speech program in the publics/speak
directory. The directory includes man pages and source.
Availability: on the SGI Developer Toolbox 4.0 CDROM
Package: rsynth
Platform: Various (including Sun, Linux, NeXT, SGI)
Description: Text-to-speech converter produced by combination of
various public-domain pieces.
Price: Free
Availability: by anonymous ftp from
svr-ftp.eng.cam.ac.uk:/comp.speech/sources/rsynth-1.0.tar.Z
svr-ftp.eng.cam.ac.uk:/comp.speech/sources/rsynth-1.0.tar.gz
Package: SENSYN speech synthesizer
Platform: PC, Mac, Sun, and NeXt
Rough Cost: $300
Description: This formant synthesizer produces speech waveform files
based on the (Klatt) KLSYN88 synthesizer. It is intended
for laboratory and research use. Note that this is NOT a
text-to-speech synthesizer, but creates speech sounds based
upon a large number of input variables (formant frequencies,
bandwidths, glottal pulse characteristics, etc.) and would
be used as part of a TTS system. Includes full source code.
Availability: Sensimetrics Corporation, 64 Sidney Street, Cambridge MA 02139.
Fax: (617) 225-0470; Tel: (617) 225-2442.
Email: sensimetrics@sens.com
Package: SPCHSYN.EXE
Platform: PC?
Availability: By anonymous ftp from evans.ee.adfa.oz.au (131.236.30.24)
in /mirrors/tibbs/Applications/SPCHSYN.EXE
It is a self extracting DOS archive.
Requirements: May require special TI product(s), but all source is there.
Package: CSRE: Canadian Speech Research Environment
Platform: PC
Cost: Distributed on a cost recovery basis
Description: CSRE is a software system which includes in addition to the
Klatt speech synthesizer, SPEECH ANALYSIS and EXPERIMENT CONTROL
SYSTEM. A paper about the whole package can be found in:
Jamieson D.G. et al, "CSRE: A Speech Research Environment", Proc.
of the Second Intl. Conf. on Spoken Language Processing, Edmonton:
University of Alberta, pp. 1127-1130.
Hardware: Can use a range of data aqcuisition/DSP
Availability: For more information about the availability of this software
contact Krystyna Marciniak - email march@uwovax.uwo.ca
Tel (519) 661-3901 Fax (519) 661-3805.
For technical information email ramji@uwovax.uwo.ca
Note: A more detailed description is given in Q1.8 on speech environments.
Package: Eloquence (currently an alpha release)
Platform: Windows and Solaris
Description: Software based text-to-speech package. Generates waveforms
completely algorithmically instead of by concatenating waveforms,
for maximum flexibility and naturalism. For instance, when the
user requests a deeper voice, the software simulates a larger vocal
tract, instead of simply pitch-shifting samples.
Uses high-level linguistic parsing, which obviates the need for a
huge dictionary. Handles numbers, acronyms, currency, etc.
Includes a set of annotation symbols, for placing stress on particular
words, expressing excitement/boredom, etc. Also allows phonetic input.
The final version, including support for Windows DDE and OLE and
UNIX Sockets, will be released by the end of 1994.
Produces male and female voices for General American English.
Dialects under development include Alabama, Brooklyn, and Boston.
Price: $5000 (unconfirmed)
Availability: Eloquent Technology, Inc.
24 Highgate Circle
Ithaca, NY 14850
Ph: (607) 257-6829 Fax: (607) 272-0058
Package: JSRU
Platform: UNIX and PC
Cost: 100 pounds sterling (from academic institutions and industry)
Description: A C version of the JSRU system, Version 2.3 is available.
It's written in Turbo C but runs on most Unix systems with very
little modification. A Form of Agreement must be signed to say
that the software is required for research and development only.
Contact: Dr. E.Lewis (eric.lewis@uk.ac.bristol)
Package: Klatt-style synthesiser
Platform: Unix
Cost: Free
Description: Software posted to comp.speech in late 1992.
Availability: By anonymous ftp from the comp.speech archives as
svr-ftp.eng.cam.ac.uk:/comp.speech/sources/klatt-0.02.tar.Z
Package: Speech Manager and PlainTalk
Platform: Macintosh
Cost: Free
Description: Apple's new text-to-speech system extension(s) that enable
applications (listed below) to perform text-to-speech
conversion. The Speech Manager runs on most Macs, but PlainTalk
(and the high quality voices) requires a 68020 Mac or better.
Availability: By anonymous ftp from:
ftp.apple.com:/dts/mac/sys.soft/speech
There are 3 files in this directory:
6273632 Aug 14 22:51 macintalk-pro.hqx
PlainTalk Text-To-Speech 1.0 speech synthesizer
extension (includes Female Voice, Compressed);
TTS Female Voice; TTS Male Voice; and
TTS Male Voice, Compressed. Requires 68020 or better!
370108 Aug 13 04:30 speech-manager-docs.hqx
Apple DocViewer format (Inside Macintosh style,
no installation instructions - just drag everything
onto your closed System Folder).
262569 Aug 7 07:01 speech-manager.hqx
Speech Manager 1.1.1 (includes Marvin's voice) and
MacInTalk Voices 1.1.1 (9 more voices). Runs most Macs.
Package: Various Mac Speech Output Applications
Platform: Macintosh
Cost: Free (except for At Ease)
Description: Some of the Speech Manager aware text-to-speech (TTS)
applications, etc. are listed below (there are more on the
Apple Developer CD-ROMs).
Application, etc. Source Comments
_________________ ________ _________________________________________________
AddressSpeech info-mac 4D talking address book (from Speech Pack 2.0)
At Ease 2.0 MacWarehouse Friendly desktop that speaks file names
At Ease 2.0 WG MacWarehouse Friendly desktop that speaks file names
Eliza 3.1 AOL Talking Eliza (Rogerian psych therapist)
FB speech Inside Basic Mag, volume 3, no. 6. FutureBasic demo
FB Speech demo Inside Basic Mag, volume 3, no. 7. FutureBasic demo
Fortune 1.1 info-mac Like a talking UNIX fortune command - slick
Homer 0.92d9 zaphod.ee.pitt.edu GUI IRC client, assign nicks voices - slick
MacMessage 1.0 FirstClassBBS Share talking messages/customizable startup
Say info-mac MPW Tool which converts standard input to speech
ScriptTools 1.2 info-mac Write AppleScript scripts to say text messages
Siege Watch 1.01f info-mac Wryly political speaking clock
SoToSpeak1.0.0b10 info-mac Two voice conversation (also see Fortune's About)
Speak It! info-mac Type in a message and have it spoken
Speaker 1.11 info-mac Simple text file editor, speaks on <CR>, macros
Speecher 1.2.1 info-mac Customizable word pronunciation/substitution
SpeechManagerdemo info-mac Command line interface, C source, aka -explorer
Speech Pack 2.0 info-mac 4th Dimension external, add speech to database
SpeechUnitEx info-mac Pascal source code for speech in Lab 7
speek-02b info-mac Speech XCMD for HyperCard
TalkingClockPro2.0info-mac AppleScriptable talking clock extension (2.0b0)
TeachText 7.2 AV Mac Apple's talking TeachText (simple editor w/QT)
Tex-Edit 1.9 AOL Talking word processor, McSink like, modeming
VoiceDemo 1.0.1 info-mac Bare bones phrase talker
Welcome!v1.3.1 info-mac A talking Welcome to Macintosh startup
? ? Talking Plug-In-Module for MS Word 5,
experimental, unsupported, buggy, beware!
Speech Rhythms AOL A cool text file for one of the above apps
_____
Sources:
AOL = America Online
info-mac = {ftp sumex-aim.stanford.edu, ftp wuarchive.wustl.edu, et al.}
MacWarehouse = (800) 255-6227
Apple's work in spoken language technologies and systems is described in:
Lee, Kai-Fu. "The Conversational Computer: An Apple Perspective."
(Keynote Speech) In Proc. Eurospeech in Berlin, ESCA, September, 1993.
Package: MacinTalk
Platform: Macintosh
Cost: Free
Description: Formant based speech synthesis.
There is also a program called "tex-edit" which apparently
can pronounce English sentences reasonably using Macintalk.
Note: MacinTalk doesn't run reliably on Macintosh's with new
sound hardware under the lastest OS (System 7.1 w/HUD 2.0).
More recent software is listed above.
Availability: By anonymous ftp from many archive sites (have a look on
archie if you can). tex-edit is on many of the same sites. Try
wuarchive.wustl.edu:/mirrors2/info-mac/Old/card/macintalk.hqx[.Z]
/macintalk-stack.hqx[.Z]
wuarchive.wustl.edu:/mirrors2/info-mac/app/tex-edit-15.hqx
Package: Lernout & Hauspie Text-To-Speech SDK
Platform: IBM-Compatible
Description: The L&H Text-to-Speech software developers kit is able
to integrate text-to-speech technology with your own or existing
PC applications under Microsoft Windows 3.1. This software will
allow conversion of written text into clear human sounding synthetic
speech.
Requirements: IBM-compatible PC 386 DX(33Mhz) or higher, 8Mb RAM,
MS DOS 5.0(or higher), MS Windows 3.1 (or higher),
Compiler and linker: Microsoft(R) Visual C++ or Borland C++,
Windows(TM) 3.1 compatible sound card, preferably 16 bit
e.g. Soundblaster, Windows Sounds System, Pro Audio Spectrum
Price: Unconfirmed $1,999 per copy, and $499 per each additional language
(American English, French, German, or Spanish).
Contact: USA (617) 932-4118
Package: Tinytalk
Platform: PC
Description: Shareware package is a speech 'screen reader' which is use
by many blind users.
Availability: By anonymous ftp from handicap.shel.isc-br.com.
Get the files /speech/ttexe145.zip & /speech/ttdoc145.zip.
Package: Narrator - narrator.device
Platform: Amiga
Description: Formant based speech synthesis. Includes a Engish-to-phoneme
translation library, and a SPEAK: pseudo-device for speech
output.
Hardware: Standard Amiga hardware
Availability: Part of AmigaOS
Product Series: Infovox
Description: Multilingual Text-to-speech systems, languages available:
American English, British English, German, French, Spanish,
Italian, Swedish, Norwegian, Icelandic, Danish and Finnish.
Product name: INFOVOX 500, PC BOARD
* Product description: Half length expansion board for IBM PC, XT, AT,
PS/2 model 30 or compatible personal computers. The board can
also be connected via the serial port. Language and control program
for downloading into RAM or mounted on EPROMs.
* Platform: for IBM PC, XT, AT, PS/2 model 30 or compatible
Product name: INFOVOX 600, OEM BOARD
* Product description: OEM board built with CMOS IC's. Language and
control program are stored in on-board fixed memory.
* Platform: any, Interface: 9-pole D-SUB (RS 232-C) 300-9600 Baud
Product name: INFOVOX 700, DESKTOP UNIT
* Product description: Desktop unit with built in Infovox 600 to be
connected to any computer or terminal via an RS 232-C serial
interface. Built in loudspeaker and rechargable battery for 4 hours
use, and control knobs for continuous control of speech volume and
speed.
* Platform: any
Product name: INFOVOX 650, OEM BOARD
* Product description: OEM-board built with CMOS IC's. Language and
control program are stored in on-board memory.
* Platform:any, Interface: 9 pole D-SUB (RS 232-C) 300-9600 Baud
Product name: INFOVOX 750, DESKTOP UNIT
* Product description: Desktop unit with built in Infovox 650 to be
connected to any computer or terminal via an RS 232-C serial
interface. Built in loudspeaker and rechargable battery for 5 hours
use, and a control knob for continuous control of speech volume.
* Platform: any
Misc: Infovox multi-lingual Text-to-Speech Technologies can interface with
Apple's PlainTalk System. It enables Apple Third party developers
to write application software with synthetic speech output using
their usual Apple Plain Talk Text-to-Speech interface. Software
already written for the English speaking market using Apple Plain
Talk can be now distributed worldwide, provided message strings
are translated.
Contact: TELIA PROMOTOR INFOVOX AB
TTS Sales Division
P.O. Box 2069
S-171 02 Solna, Sweden
Ph: +46 8 764 35 00 Fax: +46 8 735 78 76
email: tts-sales@infovox.se
SIMTEL-20
The following is a list of speech related software available from
SIMTEL-20 and its mirror sites for PCs.
The SIMTEL internet address is WSMR-SIMTEL20.Army.Mil [192.88.110.20]
Try looking at your nearest archive site first.
Directory PD1:<MSDOS.VOICE>
Filename Type Length Date Description
==============================================
AUTOTALK.ARC B 23618 881216 Digitized speech for the PC
CVOICE.ARC B 21335 891113 Tells time via voice response on PC
HEARTYPE.ARC B 10112 880422 Hear what you are typing, crude voice synth.
HELPME2.ARC B 8031 871130 Voice cries out 'Help Me!' from PC speaker
SAY.ARC B 20224 860330 Computer Speech - using phonemes
SPEECH98.ZIP B 41003 910628 Build speech (voice) on PC using 98 phonemes
TALK.ARC B 8576 861109 BASIC program to demo talking on a PC speaker
TRAN.ARC B 39766 890715 Repeats typed text in digital voice
VDIGIT.ZIP B 196284 901223 Toolkit: Add digitized voice to your programs
VGREET.ARC B 45281 900117 Voice says good morning/afternoon/evening
Package: Bliss
Contact: Dr. John Merus (Brown University) Mertus@browncog.bitnet
Package: xxx
Platform: (PC, Mac, Sun, NeXt etc)
Rough Cost: (if appropriate)
Description: (keep it brief)
Hardware: (requirement list)
Availability: (ftp info, email contact or company contact)
Can anyone provide information on the following:
MultiVoice
Monolog
TrueSpeech from DSP Group Inc.
The range of recently released Windows products
Please email or post suitable information for this list. Commercial,
public domain and research packages are all appropriate.
=======================================================================
SECTION 6 - Speech Recognition
Q6.1: What is speech recognition?
Automatic speech recognition is the process by which a computer maps an
acoustic speech signal to text.
Automatic speech understanding is the process by which a computer maps an
acoustic speech signal to some form of abstract meaning of the speech.
------------------------------------------------------------------------
Q6.2: How can I build a very simple speech recogniser?
Doug Danforth provides a detailed account in article 253 in the comp.speech
archives - also available as file info/DIY_Speech_Recognition.
The first part is reproduced here.
QUICKY RECOGNIZER sketch:
Here is a simple recognizer that should give you 85%+ recognition
accuracy. The accuracy is a function of WHAT words you have in
your vocabulary. Long distinct words are easy. Short similar
words are hard. You can get 98+% on the digits with this recognizer.
Overview:
(1) Find the begining and end of the utterance.
(2) Filter the raw signal into frequency bands.
(3) Cut the utterance into a fixed number of segments.
(4) Average data for each band in each segment.
(5) Store this pattern with its name.
(6) Collect training set of about 3 repetitions of each pattern (word).
(7) Recognize unknown by comparing its pattern against all patterns
in the training set and returning the name of the pattern closest
to the unknown.
Many variations upon the theme can be made to improve the performance.
Try different filtering of the raw signal and different processing methods.
------------------------------------------------------------------------
Q6.2: What does speaker dependent/adaptive/independent mean?
A speaker dependent system is developed (trained) to operate for a single
speaker. These systems are usually easier to develop, cheaper to buy and
more accurate, but are not as flexible as speaker adaptive or speaker
independent systems.
A speaker independent system is developed (trained) to operate for any
speaker or speakers of a particular type (e.g. male/female, American/English).
These systems are the most difficult to develop, most expensive and currently
accuracy is not as good. They are the most flexible.
A speaker adaptive system is developed to adapt its operation for new
speakers that it encounters usually based on a general model of speaker
characteristics. It lies somewhere between speaker independent and speaker
dependent systems.
Each type of system is suited to different applications and domains.
------------------------------------------------------------------------
Q6.3: What does small/medium/large/very-large vocabulary mean?
The size of vocabulary of a speech recognition system affects the complexity,
processing requirements and the accuracy of the system. Some applications
only require a few words (e.g. numbers only), others require very large
dictionaries (e.g. dictation machines).
There are no established definitions but the following may be a helpful guide.
small vocabulary - tens of words
medium vocabulary - hundreds of words
large vocabulary - thousands of words
very-large vocabulary - tens of thousands of words.
------------------------------------------------------------------------
Q6.4: What does continuous speech or isolated-word mean?
An isolated-word system operates on single words at a time - requiring a
pause between saying each word. This is the simplest form of recognition
to perform, because the pronunciation of the words tends not affect each
other. Because the occurrences of each particular word are similar they are
easier to recognise.
A continuous speech system operates on speech in which words are connected
together, i.e. not separated by pauses. Continuous speech is more difficult
to handle because of a variety of effects. First, it is difficult to find
the start and end points of words. Another problem is "coarticulation".
The production of each phoneme is affected by the production of surrounding
phonemes, and similarly the the start and end of words are affected by the
preceding and following words. The recognition of continuous speech is also
affected by the rate of speech (fast speech tends to be harder).
------------------------------------------------------------------------
Q6.5: How is speech recognition done?
A wide variety of techniques are used to perform speech recognition.
There are many types of speech recognition. There are many levels of
speech recognition/processing/understanding.
Typically speech recognition starts with the digital sampling of speech.
The next stage would be acoustic signal processing. Common techniques
include a variety of spectral analyses, LPC analysis, the cepstral transform,
cochlea modelling and many, many more.
The next stage will typically try to recognise phonemes, groups of phonemes
or words. This stage can be achieved by many processes such as DTW (Dynamic
Time Warping), HMM (hidden Markov modelling), NNs (Neural Networks), and
sometimes expert systems. In crude terms, all these processes to recognise
the patterns of speech. The most advanced systems are statistically
motivated.
Some systems utilise knowledge of grammar to help with the recognition
process.
Some systems attempt to utilise prosody (pitch, stress, rhythm etc) to
process the speech input.
Some systems try to "understand" speech. That is, they try to convert the
words into a representation of what the speaker intended to mean or achieve
by what they said.
------------------------------------------------------------------------
Q6.6: What are some good references/books on recognition?
Some general introduction books on speech recognition:
Fundamentals of Speech Recognition; Lawrence Rabiner & Biing-Hwang Juang
Englewood Cliffs NJ: PTR Prentice Hall (Signal Processing Series), c1993
ISBN 0-13-015157-2
Speech recognition by machine; W.A. Ainsworth
London: Peregrinus for the Institution of Electrical Engineers, c1988
Speech synthesis and recognition; J.N. Holmes
Wokingham: Van Nostrand Reinhold, c1988
Douglas O'Shaughnessy -- Speech Communication: Human and Machine
Addison Wesley series in Electrical Engineering: Digital Signal Processing,
1987.
Electronic speech recognition: techniques, technology and applications
edited by Geoff Bristow, London: Collins, 1986
Readings in Speech Recognition; edited by Alex Waibel & Kai-Fu Lee.
San Mateo: Morgan Kaufmann, c1990
More specific books/articles:
Hidden Markov models for speech recognition; X.D. Huang, Y. Ariki, M.A. Jack.
Edinburgh: Edinburgh University Press, c1990
Automatic speech recognition: the development of the SPHINX system;
by Kai-Fu Lee; Boston; London: Kluwer Academic, c1989
Prosody and speech recognition; Alex Waibel
(Pitman: London) (Morgan Kaufmann: San Mateo, Calif) 1988
S. E. Levinson, L. R. Rabiner and M. M. Sondhi, "An Introduction to the
Application of the Theory of Probabilistic Functions of a Markov Process
to Automatic Speech Recognition" in Bell Syst. Tech. Jnl. v62(4),
pp1035--1074, April 1983
R. P. Lippmann, "Review of Neural Networks for Speech Recognition", in
Neural Computation, v1(1), pp 1-38, 1989.
------------------------------------------------------------------------
Q6.7: What speech recognition packages are available?
Information is included below on the following packages:-
Voice Blaster Ver. 4.0
Votan
HTK (HMM Toolkit)
DragonDictate
VoiceServer for Windows
IN3 Voice Command for Windows
IN3 Voice Command
SayIt
Recnet
Voice Command Line Interface
DATAVOX
Package Name: Voice Blaster Ver. 4.0
Platform: IBM AT or higher, DOS or Wndows 3.1
Description: Uses a Sound Blaster or compatible board. Contains a
microphone headset and a connector for LPT1:. A printer can
still be used on LPT1:. Will recognize 1024 words that are
trained by the operator. Each word activates a macro that can
enter an ascii word on the screen or into a word processor or
invoke a batch file. An optional footswitch may be installed.
Software to run under DOS or Windows 3.1 is included.
Cost: Around $150 Canadian.
Contact: COVOX Inc.
675 Conger Street
Eugene, Oregon
97402
Ph: (503) 342-1271 Fax: (503) 342-1283
BBS: (503) 342-4135
Package Name: Votan
Platform: MS-DOS, SCO UNIX
Description: Isolated word and continuous speech modes, speaker dependant
and (limited) speaker independent. Vocab size is 255 words or up to a
fixed memory limit - but it is possible to dynamically load different
words for effectively unlimited number of words.
Rough Cost: Approx US $1,000-$1,500
Requirements: Cost includes one Votan Voice Recognition ISA-bus board
for 386/486-based machines. A software development system is also
available for DOS and Unix.
Misc: Up to 8 Votan boards may co-exist for 8 simultaneous voice users.
A telephone interface is also available. There is also a 4GL and a
software development system.
Apparently there is more than one version - more info required.
Contact: 800-877-4756, 510-426-5600
Package Name: HTK (HMM Toolkit) - From Entropic
Platform: Range of Unix platforms.
Description: HTK is a software toolkit for building continuous density HMM
based speech recognisers. It consists of a number of library
modules and a number of tools. Functions include speech analysis,
training tools, recognition tools, results analysis, and an
interactive tool for speech labelling. Many standard forms of
continuous density HMM are possible. Can perform isolated word or
connected word speech recognition. It van model whole words, sub-
word units. Can perform speaker verification and other pattern
recognition work using HMMs. HTK is now integerated with the
ESPS/Waves speech research environment which is described in
Section 1.8 of this posting.
Misc: The availability of HTK changed in early 1993 when Entropic obtained
exclusive marketing rights to HTK from the developers at Cambridge.
Cost: On request.
Contact: Entropic Research Laboratory, Washington Research Laboratory,
600 Pennsylvania Ave, S.E. Suite 202, Washington, D.C. 20003
(202) 547-1420. email - info@wrl.epi.com
Package Name: DragonDictate-30K
Platform: PC
Description: Speaker dependent/adaptive system requiring words to be
separated by short pauses. Vocabulary of 25,000 words including
a "custom" word set.
Rough Cost: $5000
Requirements: Minimum of 20 Mhz 386 with 8M memory and 10M disk space
Contact: Dragon Systems Inc.
90 Bridge Street, Newton MA 02158
Tel: 1-617-965-5200, Fax: 1-617-527-0372
Package Name: VoiceServer for Windows
Platform: PC
Description: Speaker dependent, each with an independent directory.
Isolated word. Upto 1000 words/user, 300 words/window.
1 word occupies 2Kb on hard disk.
Can be used to control Windows applications by issuing
voice commands instead of menu selection.
Rough Cost: 292 Pounds(UK)
Requirements: None
Misc: Price includes a half-sized AT voice card (including a
DSP), software, documentation & a microphone (attachable to
keyboard or speaker). A light-weight high-spec headset is an
optional extra.
Contact: Mark Redwood
Applied Voice Technologies
26 Danbury Street, Islington,
London, UK, N1 8JU
Ph: + 44 71 454 1224 : Fax: + 44 71 454 1225
Package Name: IN3 Voice Command for Windows
Platform: PC with Windows 3.1
Description: IN3 is now available for MS-Windows. Users can call
applications to the foreground with voice commands. Once the
application is called, the user may enter commands and data with
voice commands. Voice macros can reduce the strain of repetitive
stress injuries (RSI) such as Carpel Tunnel Syndrome (CTS) by
replacing heavy repetitive keyboard hammering with simple voice
operations. Voice macros take complex operations and reduce them
to simple verbal commands. Voice input can provide new facilities
for tasks which could not easily have been otherwise performed
without the multiple axis of input. IN3 is hardware-independent,
users with any Windows-compatible audio add speech recognition to
the desktop. IN3 works with either 8 bit or 16 bit Windows audio
boards. IN3 is based on continuous word-spotting technology. A
developer API is also available for creating voice-enabled
applications.
Price: $179 U.S.
Requirements: PC with 80386 processor or better, Microsoft Windows 3.1, and
Windows compatible audio system with microphone.
Misc: Fully functional demos are available on Compuserve in various
Multimedia and CAD forums. Demos are also available from "America
on Line", the comp.binaries.ms-windows archive sites, and various
BBS systems. It is also available by anonymous ftp as
ftp.wustl.edu:/usenet/comp.binaries.ms-windows/v3/in3demo.zip
ftp.uwasa.fi:/mirror/ultrasound/demo/in3demo.zip
An equivilant Sun product is described below.
Contact: Brantley Kelly
Email: cbk@gacc.atl.ga.us CIS: 75120,431
FAX: 1-404-925-7924 Phone: 1-404-925-7950
Command Corp. Inc, 3675 Crestwood Parkway, Duluth GA 30136, USA
Package Name: IN3 Voice Command
Platform: Sun SPARCstation
Description: IN3 provides a secure, robust, word spotting, continuous
speech recognition facility for the Sun OS or Solaris operating
systems. The recognition system is a secure operating system
facility capable of working with various interfaces, microphones,
and devices. The operating system interface works with native UNIX
outside of X Windows as well as provides enhanced X Windows facilities
including named window support. The user interface provides a
means to quickly create commands on the fly for replacing long strings
and complex operations with voice macros. [Voice macros can reduce
the strain of repetitive stress injuries (RSI) such as Carpel Tunnel
Syndrome (CTS) by replacing heavy repetitive keyboard hammering with
simple voice operations. ]
The IN3 user interface works with generic X servers and window
managers. A developer API is also available for creating voice-
enabled applications, interfacing with other audio sources, and
providing extensive application control over the recognition facility.
Availability: SunSite archive at SunSITE.unc.edu as well as on Catalyst
CDware as both a runable demo and unlockable software.
Hardware Required: Sun SPARCstation with audio input.
Noise canceling microphone recommended but not required.
Software Required: Sun OS 4.1.2 with OpenWindows 3.0 or
Sun OS 4.1.3 or
Solaris 2.1 or Solaris 2.2
Misc: An equivilant MS-Windows product is described above.
Price: $495 U.S.
Contact: Brantley Kelly
Email: cbk@gacc.atl.ga.us CIS: 75120,431
FAX: 1-404-925-7924 Phone: 1-404-925-7950
Command Corp. Inc, 3675 Crestwood Parkway, Duluth GA 30136, USA
Package Name: Phonetic Engine 400 (PE400) - Speech Systems, Inc.
Platform: PC
Description: Speaker independent, large vocabulary, continuous speech
recognition for MS Windows or DOS.
Rough Cost: $1195 US dollars. Includes board, microphone, developer kit,
documentation, 2 days of technical training and 90 days of
technical support.
Requirements: IBM AT class machine or better plus 5M disk space. Most
processing is performed on-board (4M standard or 16M upgrade).
Misc: Requires developer to provide a context-free grammar.
Vocabulary size unknown (quotes from 500 - 2000 words per grammar),
but dynamic grammar switching capabilities may increase the
effective vocabulary size.
Development system includes lower-level C,C++ library (VoiceLib),
higher-level DLL (SPOT) callable from many languages, SPOT/VBX,
a custom control for Visual Basic and Visual C++.
Contact: Speech Systems, Inc.
2945 Center Green Court South
Boulder, CO 80301-2275, USA
Tel: 303.938.1110 Fax: 303.938.1874
Package Name: SayIt
Platform: Sun SPARCstation
Description: Voice recognition and macro building package for Suns
in the Openwindows 3.0 environment. Speaker dependent discrete speech
recognition. Vocabularies can be associated to applications and the
active vocabulary follows the application that has input focus.
Macros can include mouse commands, keystrokes, Unix commands,
sound, Openwindow actions and more.
An evaluation copy is available by email.
Hardware: Microphone required (SunMicrophone is fine).
Cost: $US295
Contact: Phone: 1-800-245-UNIX or 1-415-572-0200
Fax: 1-415-572-1300
Email: info@qualix.com
Package Name: recnet
Platform: UNIX
Description: Speech recognition for the speaker independent TIMIT and
Resource Management tasks. It uses recurrent networks to estimate
phone probabilities and Markov models to find the most probable
sequence of phones or words. The system is a snapshot of evolving
research code. There is no documentation other than published
research papers. The components are:
1. A preprocessor which implements many standard and many non-
standard front end processing techniques.
2. A recurrent net recogniser and parameter files
3. Two Markov model based recognisers, one for phone recognition
and one for word recognition
4. A dynamic programming scoring package
The complete system performs competatively.
Cost: Free
Requirements: TIMIT and Resource Management databases
Contact: ajr@eng.cam.ac.uk (Tony Robinson)
Availability: by FTP from "svr-ftp.eng.cam.ac.uk" as /misc/recnet-1.3.tar.Z
Package Name: Voice Command Line Interface
Platform: Amiga
Description: VCLI will execute CLI commands, ARexx commands, or ARexx
scripts by voice command through your audio digitizer. VCLI allows
you to launch multiple applications or control any program with an
ARexx capability entirely by spoken voice command. VCLI is fully
multitasking and will run in the background, continuously listening
for your voice commands even while other programs are running.
Documentation is provided in AmigaGuide format.
VCLI 6.0 runs under either Amiga DOS 2.0 or 3.0.
Cost: Free?
Requirements: Supports the DSS8, PerfectSound 3, Sound Master, Sound Magic,
and Generic audio digitizers.
Availability: by ftp from wuarchive.wustl.edu in the file
systems/amiga/incoming/audio/VCLI60.lha and from
amiga.physik.unizh.ch as the file pub/aminet/util/misc/VCLI60.lha
Contact: Author's email is RHorne@cup.portal.com
Package Name: DATAVOX - French
Platform: PC
Description: Continuous speech - speaker independent or dependent.
Rough Cost: ?
Requirements: 2 PC format boards (RdF1000 and TdS 96/25) and an
A/D - D/A module (ASA116)
Misc: Application software may dialog with DATAVOX through 2 types
of interfaces :
1) Keyboard overlay
The application software may be used with any PC compatible
package. No specific adaptation is necessary, you only need
to define your configuration with the application software.
2) C library
Allows a user-written program to drive the recognition system.
DATAVOX is based on the AMADEUS speech recognition software
developed at LIMSI. It provides
- Continuous speech recognition with
* speaker dependant : 500 words
* speaker independant : 50 words (custom-made vocabulary).
- Grammar of the application language (syntax acquisition,
verification and simplification software).
- Large vocabulary : DATAVOX can recognize vocabularies of several
thousand words as long as there are no more than 500 words in the
active vocabulary at any given node. It takes less than 1 second
to change syntax and vocabulary.
- Training controlled by the system (use of co-articulation models).
- Response time less than 500 ms for any phrase length.
- Synthetis (ADPCM) can be heard simultaneously while recognition
is being carried out.
Contact: VECSYS, Le Chene rond, 91570 Bievres, France
Fax: 33 1 69 41 24 30
Voice: 33 1 69 41 15 04
Package: PowerSecretary
Platform: Mac
Price: $US5,000 (including a Centris or Quadra AV)
Availability: Articulate Systems Inc.
600 W. Cummings Park, Suite 4500
Woburn, MA 01801
Ph: (617) 935-5656 Fax: (617) 935-0490.
Pacakge: ICSS system from IBM
Description: A large vocabulary, speaker independent, continuous speech
system which runs under Windows, OS/2, and AIX.
Requirements: Soundboard (e.g. Soundblaster)
Price: ?
Contact: ?
Package: Creative VoiceAssist
Platform: PC (?)
Price: $US99.95
Contact: Creative Labs
Ph: 1-800-998-5227
Package Name: xxx
Platform: PC, Mac, UNIX, Amiga ....
Description: (e.g. isolated word, speaker independent...)
Rough Cost: (if applicable)
Requirements: (hardware/software needs - if applicable)
Misc:
Contact: (email, ftp or address)
Can anyone provide info on
Verbex Listen for Windows
Voice Navigator (from Articulate Systems)
SRI Recognisers
BBN Recognisers
Can you provide information on any other software/hardware/packages?
Commercial, public domain and research packages are all appropriate.
Andrew Hunt
Speech Technology Research Group Ph: 61-2-692 4509
Dept. of Electrical Engineering Fax: 61-2-692 3847
University of Sydney, NSW, 2006, Australia email: andrewh@speech.su.oz.au